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Abstract 

Let t be a permutation on [l..n] and a pattern p be a series of m distinct integer(s) of [l..n], 
m < n. The pattern p occurs in t in position i if and only if p\ . . .p m is order-isomorphic to 
ti . . . ti +m -i, that is, for all 1 < k < I < m, pu > Pe if and only if U + k-i > U+t-i- Searching a 
pattern p in a text t consists in identifying all occurrences of p in t. We first present a forward 
automaton which allows us to search for p in t in (0(m 2 log log m + n)) time. We then introduce 
a Morris-Pratt automaton representation of the forward automaton which allows us to reduce 
this complexity to (0(mloglogm + n)) at the price of an additional amortized constant term 
by integer of the text. Both automata occupy 0{m) space. We then extend the problem to 
search for a set of patterns and exhibit a specific Aho-Corasick like algorithm. Next we present 
a sub- linear average case search algorithm running in 0( 1 T o " g 1 1 ° o s g " + m " 1 °f " m ) . time. 

1 Introduction 

Let t be a permutation on [l..n] and a pattern p be a series of m distinct integer(s) of [L.n], m <n. 
The pattern p occurs in t in position i if and only if p\ . . .p m is order-isomorphic to U . . . ti+ m -i, 
that is, for all 1 < k < I < m, Pk > Pe if and only if tj+fc-i > ti+£-\. Searching a pattern p in a 
text t consists in identifying all occurrences of p in t. 

By example, p = (1, 8, 5, 6) and p' = (3, 127, 12, 56) are order-isomorphic, while p = (1, 8, 5, 6) 
and p' = (3, 127, 12, 7) are not. Also, the ending positions of the two occurrences of p = (1, 5, 2) in 
t = (1, 4, 2, 5, 3) are 2 and 5. The pattern p is usually named a consecutive motif. 

In this paper we first present a forward automaton which allows us to search for p in t in 
(0(m 2 log log m + n)) time. We then introduce a Morris-Pratt automaton representation [6] of the 
forward automaton which allows us to reduce this complexity to (0(mloglogm + n)) at the price 
of an additional amortized constant term by integer of the text. Both automata occupy 0(m) 
space. We then extend the problem to search for a set of patterns and exhibit a specific Aho- 
Corasick like algorithm. Eventually, we present a sub-linear average case search algorithm running 
in 0(n log mj log log m) time. 

Let us define some notations. The set of permutations on [l..n] is denoted cr(n) and its size 
|<r(n)| = n\. Let p G o~(n) and let us consider it as a string without symbol repetition over the 
alphabet [l..n]. We denote the set of strings without symbol repetition we can obtain by picking 
between and n integer(s) in [l..n] by a* . A prefix (resp. suffix, factor) u of p is a string such that 
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p = uw,w G a*, (resp. p = wu,w G a*, p = wuz,w,z G a*. We also denote |ui| the number of 
integer(s) in a string w, w G a* . We eventually denote p r the reverse of p, that is, the string formed 
by the symbols of p read in the reverse order. 

In the remainder of the article we denote p = the set of words of a* which are order-isomorphic 
to p. The following property is useful in order to design automaton transitions. 

Property 1 Let p = p\. . .p m G a* and w = w\ . . . we, £ < m,a* such that w is order-isomorphic 
to p\. . .pi, and let a G S. Testing if wa is order-isomorphic to p\ . . .p^pe+i can be performed in 
constant time storing only a pair of integers. 

Proof. The pair of integers {x\,X2) is determined in the following way: x\ < i is the greatest 
number such that p Xl is the position of one of the largest integer in p\..pe which is smaller than 
Pe+i, if any. Otherwise, we fix x\ arbitrarily to — oo. Symmetrically, x<i < £ is the greatest position 
of one of the smallest integer in p\..pe which is larger than pe+\, if any. Otherwise, we fix X2 to 
+00. Now, it suffices to test if w Xl < a < w X2 to verify if wa is order-isomorphic to p\ . . .pg+i- □ 

We define a function iep(p = pi . . . p m ,j) which returns a pair of integers (xi, X2) that represents 
the pair defined in property [T] for the prefix of length j of a motif p. 

2 Tools 

Before proceeding, we first describe some useful data structures we use as basic subroutines of 
our algorithms. The problem called predecessor search problem is defined as follows: given a 
set S = {x±, X2, ■ ■ . x n } C [l..u] (u is called the size of the universe), we support the following 
query: given an integer y return its predecessor in the set S, namely the only element Xi such that 
x% < y < Xi+i Q In addition, in the dynamic case, we also support updates: add or remove an 
element from the set S. 

The standard data structures to solve the predecessor search are the Balanced Binary search 
trees [HH]. They use linear space and support queries and updates in worst-case O(logn) time. 
However, there exists better data structures that take advantage of the structure of the integers 
to get better query and update time. Specifically, the Van-Emde-Boas tree [8] supports queries 
and updates in (worst-case) time O(loglogu) using 0(u) space. Using randomization, the y-fast 
trie achieves linear space with queries supported in time O (log log u) and updates supported in 
randomized O (log log u) time. The problem has received series of improvements which culminated 
with Andersson and Thorup's result pjj. They achieve linear space with queries and updates 

supported in 0(min(loglogn, y ^giogn )) (the update time is still randomized). 

A special case occurs when space n is available and the set of keys S is known to be smaller 
than log c n for some constant c. In this case all operations are supported in worst-case constant 
time using the atomic- heap [9j . 

3 Forward search automaton 

The problem we consider is to search for a motif p in a permutation t without preprocessing the 
text itself. By analogy to the simpler case of the direct search of a word p in text t, we build an 

By convention, if all the elements of S are smaller than y, then return —00 and if they are larger than y then 
return x n 
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automaton that recognizes a*p~ . We then prove its size to be linear in the length of the pattern. 
We formally define our forward search automaton J r D(p) built on p = p\ . . .p m as follows: 

• m + 1 states corresponding to each prefix (including the empty prefix) of p, state is initial, 
state m is terminal; 

• m forward transitions from state j to j + 1 labelled by rep(p, j + 1); 

• bt backward transitions S(x, where x numbers a state, < x < m, i £ 1, ... ,xU — oo, 
j € 1, . . . ,x U +oo, defined the following way: 5(x, = q if and only if for all pi < a < pj 
(resp. k = a < pj if i = — oo, pi < a if j = +oo), the longest prefix of p that is order- 
isomorphic to a suffix of pi . . . p x a is p\ . . . p q . 

We also impose some constraints on outgoing transitions. Let x be a given state corresponding to 
the prefix p\. . .p x . Let us sort all pi, 1 < i < x and consider the resulting order pi = — oo < p^ < 
■ ■ ■ < Pi k < +oo = Pi k+1 - We build one outgoing transition for each interval \py, Pu +1 ], excepted if 
Pij+i = Pi, + 1- Also we merge transitions from the same state to the same state that are labelled 
by consecutive intervals. 

It is obvious that the resulting automaton recognizes a given pattern in a permutation by 
reading one by one each integer and choose the appropriate transition. Figure [T] shows such an 
automaton. 



[-°o,5] 




[-00.3] 



Figure 1: Forward automaton built on p = (4,12,6,16,10). State is initial while state 5 is 
terminal. 

The main result on the structure of the forward automaton is the following. 

Lemma 1 The number of transitions of the forward automaton built on p\. . .p m is linear in m. 

Proof. Point 1. We adapt the technique of [7J to our framework. Let q = S(x, a backward 
transition from x to q such that q > 2. Then p\ . . .p q -\ is order-isomorphic to the suffix of p\ . . .p x 
of length q — 1. But either (a) p\ . . .p q is not order-isomorphic with p\ . . .p x , or (b) x = m (x is 
the last state of the automaton. Let £ = x — q. We prove now a contrario that no other backward 
transition q' = 5(x', such that q' > 2 can accept the same difference £' = x' — q' = I. Let 

q' = b~{x\ [i 1 ' be such a transition and consider without lost of generality that 2 < q' < q. Then 
pi . . -Pq'-i would be order-isomorphic to the suffix of p± . . .p x < of length q —' 1, and p\ . . .p q i must 
not be order-isomorphic to p\ . . .p x >p x > + i. However, as 2 < q' < q, pi . . .p q / is a prefix of p\ . . -Pq-i, 
and as l' = I', p\ . . -p q ~i is order-isomorphic to the prefix of p\ . . .p x of length q' , which is exactly 
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Pi ■ ■ -Px'Px'+i- This leads to a contradiction and for a given 1 < I < m, there exists at most one 
backward transition q = 5(x, [i,j]),q > 2 such that x — q = £. This bounds the number of such 
backward transition to m — 2. Let N(x) be the number of backward transitions q = 5(x, from 
x such that q > 2. 

Point 2. We consider now all backward transitions 1 = 5(x, reaching state 1. We denote 
such a transition a 1-transition. Note that state is never reached by any transition because any 
two integers are always order-isomorphic. The key observation is that from each state x source of 
the transition, the number of such 1-transitions from x is bounded by N(x) + 2. This is true since 
1-transitions and other transitions must be interleaved to cover [— oo, +oo]. Therefore, as the total 
number of N(x) is bounded by m — 2, the number of 1-transitions is bounded by 2m — 4. 
Point 3. The number of forward transitions is m + 1, thus the whole number of transitions is 
bounded by 4m — 5. □ 

Lemma [T] combined with the fact that the outgoing transitions from each state q are sorted 
accordingly to the closest proximity to q of their arrival state leads to the following lemma. 

Lemma 2 Searching for a consecutive motif p = p\ . . . p m in a permutation t = t\ . . .t n using a 
forward automaton built on p takes 0(n) time. 

Proof. Searching for p in t using the forward automaton of p can be easily done reading all symbols 
of the text one after the other. But at each state one must identify the right outgoing transition, 
which normally requires to search in a list or an AVL tree. This would add a polylog factor to 
all integer reading and thus the complexity would be of the form (9(n.polylog(m)). However, the 
structure of the forward automaton combined with the fact that we imposed all outgoing transitions 
of each node to be sorted increasingly to the length of the transition allow us to amortize the search 
complexity of the searching phase along the permutation. The resulting search phase complexity is 
0(n) time. Indeed, let us search t through the automaton, reading one symbol at a time reaching 
a current state x. Let us assume we read the text until position i and we want to match U+i. We 
test if ij_t_i belongs to the interval [i,j] labeling x + 1 = 5(x, [i,j]) if x < m. If yes, we follow this 
forward transition. If not, we test each backward transition from x in increasing length order. 



Figure 2: Amortized complexity of the forward search. The search starts again from q. On this 
instance I = 3 and q + 3 < x. 

The important point to notice is that after having identified the right backward transition from 
x for reaching state q (there must be one), the search for ij + 2 starts from q < x. Moreover, we 
associate all I transitions q^ = <5fc(x, [i, j]) touched before finding the right one to its ending state 
which verifies q < qk < x. Thus q + £ < x. This point is illustrated in Figure [2] As the search starts 
again from q and that at most one forward transition is passed through by text symbol, the total 
number of forward and backward transitions touched or passed through when reading the whole 
text t = ti . . . t n is thus bounded by 2n. □ 




q qi 



q 2 q 3 x 



new search from q 
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We can build the forward automation in 0(m 2 log log m) time. However, we defer the proof of 
this construction for the following reason. This 0(m 2 log log m) complexity might be too large for 
long patterns. Nevertheless, we show below that we can compute in a first step a type of Morris- 
Pratt coding of this automaton which can either (a) be directly used for the search for the pattern 
in the text and will preserve the linear time complexity at the cost of an amortized constant term 
by text symbol, or (b) be developed to build the whole forward automaton structure. 

We therefore present and build a new automaton MP that is a Morris-Pratt representation of 
the forward automaton. The idea is to avoid building all backward transitions by only considering 
a special backward single transition from each state x, x > named failure transition. We formally 
define our automaton MP{p) built on p = p% . . . p m the following way: 

• m + 1 states corresponding to each prefix (including the empty prefix) of p, state is initial, 
state m is terminal; 

• m forward transitions from state j to j + 1 labelled by rep(p, j + 1); 

• m failure transitions (non labelled) defined by: a failure transition connects a state j > to 
a state k < j if and only if pi . . . p^ is the largest order- isomorphic border of p± . . . pj . 

Figure [3] shows such an MP automaton. 




Figure 3: MP automaton built onp= (4, 12, 6, 16, 10). State is initial while state 5 is terminal. 
Backward transitions are failure transitions. 

Reading a text t through the MP representation of the forward automaton is performed the 
following way. Let us assume we reached state x < m and we read a symbol ti at position i of the 
text. Let [A;, £] = rep(p, x + 1). If ti S [tj_ m+ /%, ij_ m+ z] we follow the forward transition and the new 
current state is x + 1. Otherwise, we fail reading ti from x and we retry from state q = fail(x) and 
so-on until (a) either q is undefined, in which case we start again from state 0, either (b) a forward 
transition from q to q + 1 works, in which case the next current state is g+ 1. 

Lemma 3 Searching for a pattern p in a text t\ . . .t m using the Morris-Pratt representation of the 
forward automaton built on p is 0{n) time. 

In order to prove lemma [3] we need to focus on the classical notion of border that we extend to our 
framework. 

Definition 1 Let p G a*. A border of p is a word wo~*,\w\ < \p\ that is order-is omorphic to a 
suffix of p but also order-isomorphic to a prefix of p. 

The construction of the forward automation relies of the maximal border of each prefix that is 
followed by an appropriate integer in the pattern. The Morris-Pratt approach is based on the 
following property: 

Property 2 A border of a border is a border. 
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This property allows us to replace the direct transition of the forward algorithm by a search along 
the borders, from the longest to the smallest, to identify the longest one that is followed by the 
appropriate integer. 

Proof. [Proof of lemma [3]. Exactly as in the case of a classical text, we amortize the complexity 
of the search over the number of transitions we pass through and the number of reinitialisations 
of the search we do if no more failure transition is available. Each time we pass through a failure 
transition, we decrease the state from where we will go on the search if the state is validated. Thus, 
there can be at most as many failure transitions passed through during the whole reading of the 
text as the number of forward transitions that has been passed through. Since this number is at 
most the size of the text, the total number of transitions touched is at most 2n. Then, if after a 
descent from failure transition to failure transition no more outgoing transition exists, we reinitialise 
the search to state 1. Thus there are at most n such reinitialisations and the total complexity of 
transitions and states touched is bounded by 3n. □ 

We prove now that we can build the Morris-Pratt representation of the forward automaton 
efficiently. 

Lemma 4 Building an Morris-Pratt representation of the forward automaton on a consecutive 
motif p = pi . . .p m can be performed in (worst-case) O(mloglogm) time. 

Proof. Before processing, the pattern we first reduce the range of the keys from [l..n] to [l..m]. 
This is done in deterministic O(mloglogm) times by first sorting the keys using the fastest integer 
sorting algorithm due to Han [5], and then replacing each key by its rank obtained from the sorting. 

We then process the pattern in left-to-right in m steps and at each step j determine the failure 
and forward transitions outgoing of state j. We use two predecessor data structures that require 
0{m) words of space and support insert, delete and query operations (a query operation returns 
both the predecessor and the successor) in (worst-case) time O(loglogm). As we move forward in 
the pattern, we insert each symbol in both predecessor data structures (except for the first symbol 
which is only inserted in the first predecessor data structure). The difference between the two 
predecessor data structures is that the first one will only get insertions while the second one can 
also get deletions. The first is used to determine forward transitions while the second one is used 
to determine failure transitions. 

We now show how we determine the transitions at each step j . The forward transitions connect- 
ing state j to state j + 1 is labelled by rep(p, j + 1) . The latter is determined by doing a predecessor 
search for Pj+i on the first predecessor data structure. This gives us both the predecessor and 
successor of Pj+i among p\ ■ ■ - Pj which is exactly rep(p, j + 1). 

The failure transition is determined in the following way. If the target state of the failure 
transitions of state j — 1 is state i. Then we do a predecessor query on the the second predecessor 
data structure. If the pair of returned prefixes is precisely rep(p, i + 1), then we can make i + 1 as 
a target for state j. Otherwise we take the failure transition of state j — 1. If that transitions leads 
to a state k, then we remove the symbols pj-i..pj^k from the second predecessor data structure. □ 

Lemma [3] and 0] allow us to state the main theorem of this section. 

Theorem 1 Searching for a consecutive motif p = p\ . . . p m in a permutation t = t\ . . . t n can be 
done in 0(?n log log m + n) time. 
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The Morris-Pratt representation of the forward automaton permits to search directly in the text 
at the price of larger amortized complexity (considering the constant hidden by the O notation) 
than that required by searching with the forward automaton directly. If the real time cost of the 
search phase is an issue, the forward automaton can be built form its Morris-Pratt representation 
as follows. 

Property 3 Building the forward automaton of a consecutive motif p = p\ . . .p m can be performed 
in 0(m 2 log log m) time. 

Proof. We first build the Morris-Pratt representation in 0{m log log m) time. We then consider 
each state x > corresponding to the p\ . . . p x from left to right and for each such state we 
expand its backward transitions. Let us sort all pi, 1 < i < x and consider the resulting order 
Pi = —oo < pi x < . . . < pi k < +oo = Pi k+1 ■ We build one outgoing transition for each interval 
[PipPij+iL excepted if Pi j+1 = Pi 3 + 1. This transition is computed as follows. Let q be the image 
state of the failure transition from x. We pick a value z in [pj , an search for z from q. Let q' 
be the new state reached. We create a backward transition form x to q' labelled \pij,Pi j+1 ]- After 
this process we created at most m 2 edges in at most 0(m 2 log log m) time. 

We now merge backward transitions from the same state to the same state that are labelled 
by consecutive intervals. This required at most 0{m 2 ) time. The whole algorithm thus requires 
0{m 2 log log m) time. □ 

An interesting point is that the construction of the forward automaton from its Morris-Pratt 
representation can also be performed in a lazy way, that is, when reading the text. The missing 
transitions are then built on the fly when needed. 

4 Multiple worst case linear motif searching 

We can extend the previous problem defined for a single pattern to a set of patterns S. We note 
by d the number of patterns, by m the total length of the patterns and by r the length of the 
longest pattern. For this problem we adapt the Aho-Corasick automaton [2] (or AC automaton 
for short). The AC automaton is a generalization of the MP automaton to a set of multiple 
patterns. We note by P the set of prefixes of strings in S. In order to simplify the description we 
will assume that the set of patterns S is prefix- free. That is, we will assume that no pattern is 
prefix of another. Extending the algorithm to the case where S is non-prefix free, should not pose 
any particular issue. The states of the AC automaton are defined in the same way as in the MP 
automaton. Each state t in the AC automaton corresponds uniquely to a string p £ P. The forward 
transitions are defined as follows: there exists a forward transition connecting state s to each state 
corresponding to an element pc £ P (where c is a single symbol) . Thus this definition of the forward 
transitions matches essentially the definition of the forward transitions in the MP automaton. The 
failure transitions are defined as follows: a failure transition a state s corresponding a string p to 
the state s' corresponding to the longest string q such that q £ P and q ^ p. The matching using 
the AC automaton is done in the same way as in the MP automaton using the forward and failure 
transitions. 
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4.1 Our extension of the AC automaton 



We could use exactly the same algorithm as the one used previously for our variant of the MP 
automaton with few differences. We describe our modification to AC automaton to adapt it to the 
case of consecutive permutation matching. 

An important observation is that we could have two or more elements of P that are both of the 
same length and order-isomorphic. Those two elements should have a single corresponding state in 
the AC automaton. 

Thus, if two or more elements of P are order-isomorphic then we keep only one of them. 

For the forward transitions, we can a associate a pair of positions (0:1,0:2) to each forward 
transition. Then we can check which transition is the right one by checking the condition ij_ m + Xl < 
ti < ti- m+X2 for every pair (0:1,0:2) and take the corresponding transition. The main problem with 
this approach is that the time taken would grow to 0(d) time to determine which transition to 
take which can lead to a large complexity if d is very large. Our approach will instead be based on 
using a binary search tree (or more sophisticated predecessor data structure). With the use of a 
binary search tree, we can achieve O(logm) time to decide which transition to take. More precisely, 
each time we read T[i] we insert the pair (ti,i) into the binary search tree. The insertion uses the 
number ti as a key. Now suppose that we only pass through forward transitions. Then a transition 
at step i is uniquely determined by: 

1. the current state s corresponding to an element p £ P. 

2. the position of the predecessor of t, among • • . ti— i- 

In order to be able to determine the predecessor of ti among • • . ti— 1, the binary search 
tree should contain precisely the \p\ pairs corresponding to ii-ipi] . . . t»— 1. If the predecessor of ti 
in the binary search tree is a pair (tj,j), we then conclude that the element p[\p\ — j + 1] is the 
predecessor of ti in p. 

In order to maintain the binary search tree we must do the following actions during passing 
through a failure or a forward transition: 

1. whenever we pass through a forward transition at a step i we insert the pair (ti,i). 

2. whenever we pass through a failure transition from a state corresponding to a prefix p\ to a 
state corresponding to a prefix P2, then we should remove from the binary tree all the pairs 
corresponding to the symbols ii-| pi | • • • *t-[pa|- 

It should be noted that each removal or insertion of a pair into the binary search tree takes 
O(logr) time. The upper bound O(logr) comes from the fact that we never insert more than 
r elements in the binary search tree. Since in overall we are doing 0(n) insertions or removals, 
the amortized time should simplify to O(nlogr). Finally if we replace binary search tree with 
a more efficient predecessor data structure, we will be able to achieve randomized time 0(n ■ t) 



data structure (see section [2] for details) . We use the linear space version of the predecessor data 
structure which guarantees only randomized performance but uses 0(r) additional space only. We 
thus have the following theorem : 



where 




needed to do an operation on the predecessor 
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Theorem 2 Searching for set of d consecutive motifs of maximal length r and whose AC automa- 
ton has been built and where the longest pattern is of length r can be done in randomized 0(nt) 



4.2 Preprocessing 

We now show that the preprocessing phase can be done in worst-case 0(m log log r) time. As before 
our starting point will be to sort all the patterns and reduce the range of symbols of each pattern 
of length I from range [l..n] to the range [1..^]. This takes worst-case time 0{m log log r). 

Recall that two or more elements of P of the same length and order-isomorphic should be 
associated with the same state in the AC automaton. 

In order to identify the order-isomorphic elements of P, we will carry a first step called nor- 
malization. It consists in normalizing each pattern. A pattern p is normalized by replacing each 
symbol pj by the pair rep(p = p\ . . .pj-%,j) (consisting in the positions of the predecessor and 
successor among symbols p\ . . .pj-i). This can be done for all patterns in total O(mloglogr) time. 
In the next step, we build a trie on the set of normalized patterns. This takes linear time. The trie 
naturally determines the forward transitions. More precisely any node in the trie will represent a 
state of the automaton and the the labelled trie transitions will represent follow transitions. 

Note that unlike the forward automaton (or the MP automaton) there could be more than one 
outgoing forward transition from each node. 

In order to encode the outgoing transition from each node, we will make use of a hash table 
that stores all the transitions outgoing from that node. More precisely for each transition labelled 
by the pair rep(p = p\ . . .pj-%,j) and directed to a state q, the hash table will associate the key p\ 
associated with the value q. Now that the next transitions have been successfully built, the final 
step will be to build the failure transitions and this takes more effort. 

In order to build the failure transitions we decompose the trie into r layers. The first layer 
consists in the nodes of the trie that represent prefixes of length 1. The second layer consist in all 
the nodes that represent prefixes of length 2 etc.. 

Next, we will reuse the same algorithm that was used in [4] to build the A4P automaton but 
adapted to work on the AC automaton. 

Instead of using a single predecessor data structure we will use multiple predecessor data struc- 
tures and attach a pointer to a predecessor data structure at each trie node. A node of the original 
non compacted trie will share the same predecessor with its parent, iff it is the only child of its 
parent. 

The following building phases will no longer reuse the normalized patterns, but instead reuse 
the original patterns. To each node, we attach a pointer to one of the original pattern. More 
precisely if a node has a single child, then his pattern pointer will be the same as its (only) child 
pattern pointer. If a node has more than one child (in which case it is called a branching node), 
then it will point to the shortest pattern in its subtree. If a node is a leaf then it will directly 
point to the corresponding pattern. A predecessor data structure of a node whose pattern pointer 
points to a pattern of length u will have capacity to hold u keys from universe u and thus will 
use 0(u) space. This is justified by the fact that the predecessor data structure will only hold at 
most u elements of the patterns and each element value is at most u (recall that the pattern is a 
permutation of length u). 

In order to bound the total number of predecessor data structures and their total size, we 



time, where t = min(loglogn, 




d). 



9 



consider a compacted version of the trie (Patricia trie), where each node with a single child is 
merged with that single child. A node in the original (non-compacted) trie with two of more 
children is called branching node. It is clear that the set of nodes of a patricia (compacted) trie are 
precisely the branching nodes and the leaves of the original trie. 

It is a well known fact that a Patricia trie with r leaves has at most 2r — 1 nodes in total. 
Thus the total number of predecessor data structures will be upper bounded by 2r — 1. During the 
building if a node at layer t has a single child, then that single child at level t + 1 will inherit the 
predecessor data structure of its parent. Otherwise if the node v at level t has two or more children 
at level t + 1, then a predecessor data structure is created for each child u. Then if the predecessor 
data structure of v contains exactly k elements, those elements are precisely Xt+i-k ■ ■ ■ %t, where x 
is string pointed by v. We will insert the k elements yt+i-k ■ ■ ■ Vt m to the predecessor data structure 
of u, where y the string pointed by u. 

In order to bound the total space used by the predecessor data structure, we notice that the 
total space used by the predecessor data structures of the leaves will be no more than the total 
length of the patterns that correspond to the leaves which is 0{m). 

In order to bound the total space used by the predecessor data structures, we notice that the 
total capacities of all predecessor data structures is 0(m). This can easily be proved. Because 
we know that the total length of all patterns is bounded by m, we will also know that the total 
cumulative length of all strings pointed by branching node is also upper bounded by m. This is 
because precisely the pointed strings are precisely the shortest strings in the subtrees rooted by 
the branching node. The same holds for the leaves as the capacities of their respective predecessor 
data structures will be no more than the total length of the patterns that correspond to the leaves 
which is 0{m). 

We finally need to bound the total construction time which is dominated by the operations 
on the predecessor data structures. The time is clearly bounded by 0(m log log r). This is by a 
straightforward argument: as the total sum of the pointed strings is O(m), and we know that each 
element of a pointed string can only be inserted or deleted once, and furthermore each insert /delete 
cost precisely O (log log r) worst-case time, we conclude that the total time spent in the predecessor 
data structure is worst case 0(m log log r). We thus have the following theorem: 

Theorem 3 Building the AC automaton for a set of d consecutive motifs of total length m and 
where the longest motif is of length r can be done in worst-case 0(m log log r) time. 

5 Single sublinear average-case motif searching 

Algorithm forward takes 0(n + m log log m) time in the worst case time but also on average. We 
present now a very simple and efficient average case-algorithm which takes O ( + n m /og i^ m ) 
time. 

In order to search for a pattern p in t, we first build a tree T of all isomorphic-order factors 
of p r of length ^ }°|; ™ . T is built by inserting each such factor one after the other in a tree and 
building the corresponding path if it does not already exist. The construction of this tree requires 
log log m) time (details are given below). 

The search phase is performed through a window of size m that is shifted along the text. For 
each position of this window, b = ^ }°| ^ symbols are read backward from the end of the window 
in the tree T. Two cases may occurs. 
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• either the factor is not recognized as a factor of p r . This means that no occurrence of p might 
overlap this factor and we can surely shift the search window after the last symbol of this 
factor; 

• either the factor is recognized, in which case we simply check if the motif is present using 
a naive 0(m) algorithm, and we repeat this test for the next 0{m/2) symbols. This might 
require 0(m 2 /2) steps in the worst case. 

Figure [4] illustrates the first case. 

search window of size m 

m/2 5 log m/ fog fog »j j 

safe shift 



Figure 4: First case: the motif w r is not recognized in the tree T, which implies that no occurrence 
of p can overlap w and the search window can surely be shifted after the first symbol of w. 

Let us analyze the average complexity of our algorithm, in a Bernoulli model with equiproba- 
bility of letters, that is, every position in the text and the paper is independent of the others and 
the probability of a symbol to appear is 1/a. We also consider that b < m/2 since we are interested 
in analyzing the average complexity for pattern long enough. 

We count the average number of symbol comparisons required to shift the search window of 
m/2 symbols to the right. As there are 2n/m such segments of length m/2 symbols in n, we will 
simply multiply the resulting complexity by 2n/m to gain the whole average complexity of our 
algorithm. 

There might be 0(b\) distinct motifs that could appear in the text while this number is bounded 
by m — b + 1 in the pattern (one by position). Thus, with a probability bounded by m ^ +1 we 
will recognize the segment of the text as a factor of p and enter case 2. In which case, moving the 
search window of m/2 symbols to the right using the naive algorithm will require 0(m 2 /2) worst 
case time. 

In the other case which occurs with probability at least 1 — m ~^ +1 , shifting the search window 
by m/2 symbols to the right only requires reading b numbers. 

The average complexity (in terms of number of symbol reading and comparisons) for shifting 
by m/2 symbols is thus (upper) bounded by 

A = 0((m 2 /2) + 6(1 )) 

and the whole complexity by 0((2n/m)A). By expanding and simplifying A we get that A = 
0(b + 0(m 3 /26!)). Now using the famous Stirling approximation ln(m!) = mlnm — m + O(lnm), 
it is not difficult to prove that b\ = 2 bl °g fe - bl °g e + ( lo s b ) = f2(m 3 ) and thus A = 0(b) and the whole 
average time complexity (in terms of number of symbol reading and comparisons) turns out to be 

Qf nlogm \ 
\ m log log m I ' 
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5.1 Implementation details 

The tree T can actually be built in Ql ^g 1 "^ ) time by using appropriate data structures. Recall 

that the tree T recognizes all the factors of p r of length f Q g |°g ™ • To implement T, we use the same 
AC automaton presented in previous section to build the tree T, but with two differences: we only 
need forward transitions and the length of any pattern is bounded by i^gfogm • Thus the cost is 

upper bounded by 0( ^ g °^ n • t), where t is the time needed to do an operation on the predecessor 
data structure (maximum of the times needed for inserts/deletes and searches) We now turn our 
attention to the cost of the matching phase. From the previous section, we know that the total 
complexity in terms of number of symbol reading and comparisons is 

Q ( m"ogfogm )- The t0tal COSt 

of the matching phase is dominated by the multiplication of the total number of text symbols read 
multiplied by the cost of a transition in the AC automaton which itself is dominated by the time 
to do an operation on a predecessor data structure. The total cost of the matching phase is thus 
e>( m"ogfogm • t), where t is the time needed to do an operation on the predecessor data structure. 

Now the performance of both matching and building phases crucially depend on the used pre- 
decessor data structure. If a binary search tree is used then t = 0(log i g{^T m ) = O(loglogm) 
and the total matching time becomes 0{nt) = 0(n log log m), and the total building time becomes 
0{m log m). However, we can do better if we work in the word- RAM model. Namely, we can use 
the atomic-heap (see section [2]) which would add additional o(m) words of space and support all 
operations (queries, inserts and deletes) in constant time on sets of size log *- 1 - 1 m. In our case, we 
have a set of size 0( ^°f^ m ) an d thus the operations can be supported in constant time. We thus 
have the following theorem: 

Theorem 4 Searching for a consecutive motif p = p\. . . p m in a permutation t = t% . . .t n can be 



1 r\i mlogm , nlogm 

aone %n average u\^g 

m ' m log log m 



time. 
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